Exploratory data analysis

for collagen diseases dataset

Author
Affiliation
Xiaojing Ni

Georgetown University

Published

February 20, 2023

The code and data can be found in Github repo: https://github.com/XiaojingNi/collagen_diseases_eda

1 Data summary

In order to explore the mechanisms of collagen diseases, a dataset is extracted from a University hospital database. The overall objective is to study potential factors helping to detect and predict thrombosis, one of the severe complications cased by collagen diseases. The dataset contains three parts: users’ information and diagnosis, examination results of users’ having thrombosis, and general laboratory examination results.

The examination related to thrombosis mainly through the blood test. Thus, the dataset describing examination results of users’ having thrombosis contains anti-Cardiolipin antibody measurement and degree of coagulation measurement (the action or process of blood changing to a solid or semi-solid state), along with the degree of thrombosis. The general laboratory examination results include general blood test, such as Red blood cell count, blood glucose, and total bilirubin. Those tests are not necessary relate to diagnosed thrombosis and can happen anytime when doctors think the patient need them. The three datasets are connected by patient ID. The datasets are one to many relation. For example, one patient can have various tests on same or different date. And patients can also have one or more diagnosis.

2 Initial questions

The initial questions are listed below.
  • What are the symptoms causing the doctor suspect a patient is having a thrombosis?
  • Are some of the measurements correlated to each other?
  • Are some of the symptoms always occurring together?

3 Data munging

#| echo: false #| warning: false ### this cell is used to run python in qmd so that one notebook can run both r and python library(reticulate) use_python(“/Users/xiaojingni/miniforge3/envs/anly503/bin/python”)

3.1 TSUMOTO_A

3.1.1 load patient data

Some feature meaning
  • Description: the date when a patient was input
  • First date: the date when a patient came to the hospital
  • Admission: patient was admitted to the hospital or followed at the outpatient clinic
  • Diagnosis: some patients may suffer from several diseases
# read data
patients_df_raw = pd.read_csv(
    "../data/TSUMOTO_A.csv", encoding='windows-1252')

3.1.2 Missing values and duplicates

# basic info of the raw data
patients_df_raw
patients_df_raw.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1240 entries, 0 to 1239
Data columns (total 7 columns):
 #   Column       Non-Null Count  Dtype 
---  ------       --------------  ----- 
 0   ID           1240 non-null   int64 
 1   SEX          1227 non-null   object
 2   Birthday     1239 non-null   object
 3   Description  1023 non-null   object
 4   First Date   992 non-null    object
 5   Admission    1209 non-null   object
 6   Diagnosis    1240 non-null   object
dtypes: int64(1), object(6)
memory usage: 67.9+ KB

There are some missing values. The missing values are shown below. All of the records have a Id and Diagnosis. Birthday has one missing value. There are 248 patient without the “First Date” information (the date when a patient came to the hospital).

patients_df_raw.isna().sum()
ID               0
SEX             13
Birthday         1
Description    217
First Date     248
Admission       31
Diagnosis        0
dtype: int64

Check if ID column has duplicate value

# check if ID have duplicate value
if len(patients_df_raw) != len(set(patients_df_raw.ID)):
    print("duplicates found in the list")
else:
    print("No duplicates found in the list")
duplicates found in the list

List duplicate ID and their records

# below code is revised from https://www.trainingint.com/how-to-find-duplicates-in-a-python-list.html

id_set = set()
dupset = set()
for i in patients_df_raw.ID:
    if i not in id_set:
        id_set.add(i)
    else:
        # this method catches the first duplicate entries, and appends them to the list
        dupset.add(i)
# The next step is to print the duplicate entries, and the unique entries
print("List of duplicates\n",
      patients_df_raw.loc[patients_df_raw['ID'].isin(list(dupset))])
List of duplicates
            ID SEX   Birthday Description First Date Admission Diagnosis
264   2557319   F  1959/9/27         NaN        NaN         +       SLE
265   2557319   F  1984/9/27    98.07.28   91.11.05         +       SLE
1214  5807039   F  1972/11/8    98.06.22   98.06.08         -       APS
1215  5807039   F  1972/11/8    98.07.28   98.06.08         -       APS

For patient 2557319, only one record has full data. Thus, the duplicate is probably from a bad entry. Delete one record will solve the problem. For patient 5807039, the first date is the same, but the date input is not, keeping the one with early input.

patients_df_raw = patients_df_raw.drop(labels=[264, 1215], axis=0)

# sanity check
# check if ID have duplicate value
if len(patients_df_raw) != len(set(patients_df_raw.ID)):
    print("duplicates found in the list")
else:
    print("No duplicates found in the list")
No duplicates found in the list

3.1.3 Cleaning and data format

Some of the date entries have type, replacing the letter with “” with solve this. Below I created three new column to change the format of the date columns: “Birthday”, “Description”, and “First Date”.

# reformat
# Birthday
patients_df_raw['b-day'] = pd.to_datetime(patients_df_raw['Birthday'])

# Description day
patients_df_raw['d-day'] = pd.to_datetime(patients_df_raw['Description'])

# First Date, remove typo letters
patients_df_raw['f-day'] = pd.to_datetime(
    patients_df_raw['First Date'].replace(r"[A-Z]", "", regex=True))
# f-day (first come to hospital) and d-day need to larger than birthday
# if not, delete those entries

patients_df_raw = patients_df_raw[(patients_df_raw['f-day']
                                  > patients_df_raw['b-day']) & (patients_df_raw['d-day'] > patients_df_raw['b-day'])]

3.1.4 New variables

In this session, new columns of age_of_first_come and diagnosis-n are created. age_of_first_come is created based on the difference of Description and birthday. And diagnosis-n is generated by parsing the diagnosis column.

patients_df_raw['age_of_first_come'] = patients_df_raw.apply(
    lambda row: relativedelta.relativedelta(row['d-day'], row['b-day']).years, axis=1)
# sanity check
patients_df_raw.head(5)
ID SEX Birthday Description First Date Admission Diagnosis b-day d-day f-day age_of_first_come
0 2110 F 1934/2/13 94.02.14 93.02.10 + RA susp. 1934-02-13 1994-02-14 1993-02-10 60
1 11408 F 1937/5/2 96.12.01 73.01.01 + PSS 1937-05-02 1996-12-01 1973-01-01 59
6 43003 M 1937/11/24 94.03.08 94.03.08 - Raynaud's phenomenon 1937-11-24 1994-03-08 1994-03-08 56
9 57266 M 1923/7/25 97.02.03 93.02.05 + RA 1923-07-25 1997-02-03 1993-02-05 73
11 71417 F 1929/10/20 96.12.03 95.06.20 - RA 1929-10-20 1996-12-03 1995-06-20 67
# parsing diagnosis
diagnosis_terms = patients_df_raw['Diagnosis'].str.split(", ")
d = collections.Counter()
for phrases in diagnosis_terms:
    for phrase in phrases:
        d[phrase] += 1

sorted(d.items(), key=lambda x: x[1], reverse=True)
[('SJS', 220),
 ('SLE', 186),
 ('RA', 157),
 ('BEHCET', 62),
 ('PSS', 55),
 ('MCTD', 37),
 ('SJS susp', 36),
 ('APS', 35),
 ('PM', 30),
 ('RA susp', 22),
 ('SLE susp', 21),
 ('FUO', 15),
 ('DM', 14),
 ('PN', 10),
 ('MRA', 9),
 ('AORTITIS', 8),
 ('Raynaud', 7),
 ('DM (amyopathic)', 7),
 ('RA (seronegative)', 6),
 ('APS susp', 5),
 ('Adult STILL', 5),
 ('collagen susp', 5),
 ('PSS susp', 4),
 ('BEHCET susp', 4),
 ('Scleroderma', 4),
 ('PNc', 3),
 ('PMR susp', 3),
 ('PMR', 3),
 ('WG', 3),
 ('relapsingü@polychondritis', 2),
 ('AIHA', 2),
 ('RA,SJS', 2),
 ('PM susp', 2),
 ('relapsing polychondritis', 2),
 ('JRA', 2),
 ('UC', 2),
 ('AGA', 2),
 ('ITP', 2),
 ('Sarcoidosis', 2),
 ('PM/DM', 2),
 ('Sarcoidosis susp', 2),
 ('MCTD susp', 2),
 ('RA susp.', 1),
 ("Raynaud's phenomenon", 1),
 ('BEHCET (òsæSî^)', 1),
 ('PN (vasculitis)', 1),
 ('Sweet', 1),
 ('PSS(CREST)', 1),
 ('Hyper CPK', 1),
 ('BEHCET,neuro', 1),
 ('BEHCET-vasculo', 1),
 ('RA\x1dSJS\x1dPM', 1),
 ('Adie', 1),
 ('Vasculitis susp', 1),
 ('Vasculitis', 1),
 ('DM susp', 1),
 ('BEHCETüineuroüj', 1),
 ('PBC', 1),
 ('Psoriatic Arthritis', 1),
 ("Reiter's syndrome susp", 1),
 ('Chronic EBV', 1),
 ('BEHCET (vasculo)', 1),
 ('BEHCET(entero)', 1),
 ('drug induced hepatitis', 1),
 ('ANAPHYLACTOID PURPURA NEPHRITIS', 1),
 ('McArdle', 1),
 ('Psoriatic arthritis', 1),
 ('ASO ', 1),
 ('APS\x1dAPS', 1),
 ('PN susp', 1),
 ('PSS,SJS', 1),
 ('AORTITIS susp', 1),
 ('Myasthenia G', 1),
 ('pemphygus', 1),
 ('IP', 1),
 ('ìéCPKîîÅÌ', 1),
 ('PSS(sclerodermatomyositis)', 1),
 ('CREST', 1),
 ('ANA', 1),
 ('SJS\x1dMCTD', 1),
 ('lupoid hepatitis', 1),
 ('collagen disease susp', 1),
 ('RA(seronegative)', 1),
 ('Angioedema with eosinophilia', 1),
 ('arthralgia', 1),
 ('DLE', 1),
 ('FUO (STILL susp.)', 1),
 ('BEHCET (entero)', 1),
 ('CREST synd susp', 1),
 ('generalized morphea', 1),
 ('PNc susp', 1),
 ('ANCA', 1),
 ('EN', 1),
 ('Chrnoic Palvo V', 1),
 ('WG\x1dSJS', 1),
 ('SJS susp\x0b\x1dSJS', 1),
 ('Chronic EB virus infection susp', 1),
 ('PSS susp\x1dPSS', 1),
 ('Chronic EB', 1),
 ('Henoch-Schoelein purpura', 1),
 ('Morphia', 1),
 ('Focal Myositis', 1),
 ('Vasculitis synd. susp', 1),
 ('Weber-Christian', 1)]
diag = patients_df_raw['Diagnosis'].str.split(", ", expand=True)
diag.columns = ["Diagnosis"+str(i) for i in range(1, 5)]
patients_df = pd.concat([patients_df_raw, diag], axis=1)
patients_df.head(5)
ID SEX Birthday Description First Date Admission Diagnosis b-day d-day f-day age_of_first_come Diagnosis1 Diagnosis2 Diagnosis3 Diagnosis4
0 2110 F 1934/2/13 94.02.14 93.02.10 + RA susp. 1934-02-13 1994-02-14 1993-02-10 60 RA susp. None None None
1 11408 F 1937/5/2 96.12.01 73.01.01 + PSS 1937-05-02 1996-12-01 1973-01-01 59 PSS None None None
6 43003 M 1937/11/24 94.03.08 94.03.08 - Raynaud's phenomenon 1937-11-24 1994-03-08 1994-03-08 56 Raynaud's phenomenon None None None
9 57266 M 1923/7/25 97.02.03 93.02.05 + RA 1923-07-25 1997-02-03 1993-02-05 73 RA None None None
11 71417 F 1929/10/20 96.12.03 95.06.20 - RA 1929-10-20 1996-12-03 1995-06-20 67 RA None None None
# save to local
patients_df[["ID", "SEX", "b-day", "d-day", "f-day", "age_of_first_come", "Admission", "Diagnosis1", "Diagnosis2", "Diagnosis3",
             "Diagnosis4"]].to_csv('../data/tsumotoa_clean.csv', index=False)

3.2 TSUMOTO_B

3.2.1 load data

# read data
slab_df_raw = pd.read_csv("../data/TSUMOTO_B.csv", encoding='windows-1252')
# examine the data
slab_df_raw.head(5)
slab_df_raw.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 806 entries, 0 to 805
Data columns (total 13 columns):
 #   Column            Non-Null Count  Dtype  
---  ------            --------------  -----  
 0   ID                770 non-null    float64
 1   Examination Date  796 non-null    object 
 2   aCL IgG           806 non-null    float64
 3   aCL IgM           806 non-null    float64
 4   ANA               768 non-null    object 
 5   ANA Pattern       545 non-null    object 
 6   aCL IgA           806 non-null    float64
 7   Diagnosis         475 non-null    object 
 8   KCT               146 non-null    object 
 9   RVVT              146 non-null    object 
 10  LAC               222 non-null    object 
 11  Symptoms          80 non-null     object 
 12  Thrombosis        806 non-null    int64  
dtypes: float64(4), int64(1), object(8)
memory usage: 82.0+ KB
# values of different columns
for nm in ["KCT", "RVVT", "LAC", "ANA", "ANA Pattern"]:
    print(slab_df_raw[nm].value_counts())
    print("-------------------")
-    105
+     41
Name: KCT, dtype: int64
-------------------
-    99
+    47
Name: RVVT, dtype: int64
-------------------
-    147
+     75
Name: LAC, dtype: int64
-------------------
0        206
16       155
64       112
4        108
256       82
1024      58
4096      45
4094       1
>4096      1
Name: ANA, dtype: int64
-------------------
S        252
P        178
P,S       89
S,P       11
D,P,S      2
S,D        2
D,P        2
p          2
P.S        1
P,D        1
S,N        1
D,S        1
D          1
N          1
P.D        1
Name: ANA Pattern, dtype: int64
-------------------

3.2.2 Missing values and duplicates

The dataset contains missing values for some of the columns. Thus, for different purpose, I will use different strategies to deal with missing data. This dataset allows ID has duplication, Thus, here, I only check for duplicate entries for data quality check purpose.
Joining with other datasets requires ID column. Thus, for join analysis purpose, those records without ID information will be removed.

# deduplicate
slab_df_raw.drop_duplicates()
ID Examination Date aCL IgG aCL IgM ANA ANA Pattern aCL IgA Diagnosis KCT RVVT LAC Symptoms Thrombosis
0 14872.0 1997/5/27 1.3 1.6 256 P 0.0 MCTD, AMI NaN NaN - AMI 1
1 48473.0 1992/12/21 4.3 4.6 256 P,S 3.3 SLE - - - NaN 0
2 102490.0 1995/4/20 2.3 2.5 0 NaN 3.5 PSS NaN NaN NaN NaN 0
3 108788.0 1997/5/6 0.0 0.0 16 S 0.0 NaN NaN NaN - NaN 0
4 122405.0 1998/4/2 0.0 4.0 4 P 0.0 SLE, SjS, vertigo NaN NaN NaN NaN 0
... ... ... ... ... ... ... ... ... ... ... ... ... ...
801 NaN 1996/11/21 3.3 1.7 NaN NaN 0.0 NaN NaN NaN NaN NaN 0
802 NaN 1995/10/2 1.2 3.0 4 NaN 1.8 NaN NaN NaN NaN NaN 0
803 NaN 1997/7/14 0.0 2.2 NaN NaN 0.0 PSS NaN NaN NaN NaN 0
804 NaN NaN 0.0 19.8 4096 S 0.0 SLE+PH NaN NaN NaN PH 1
805 NaN 1998/1/16 0.0 2.0 16 P 0.0 NaN NaN NaN NaN NaN 0

806 rows × 13 columns

# For join dataset
join_B = slab_df_raw.dropna(subset=['ID'])
join_B.info
<bound method DataFrame.info of             ID Examination Date  aCL IgG  aCL IgM  ANA ANA Pattern  aCL IgA  \
0      14872.0        1997/5/27      1.3      1.6  256           P      0.0   
1      48473.0       1992/12/21      4.3      4.6  256         P,S      3.3   
2     102490.0        1995/4/20      2.3      2.5    0         NaN      3.5   
3     108788.0         1997/5/6      0.0      0.0   16           S      0.0   
4     122405.0         1998/4/2      0.0      4.0    4           P      0.0   
..         ...              ...      ...      ...  ...         ...      ...   
765  5771963.0        1998/3/13      0.0      1.5   64           P      0.0   
766  5775402.0        1998/3/23      0.0     15.6  256         P,S      7.0   
767  5775806.0        1998/3/24      0.0      0.9   16           P      0.0   
768  5779550.0        1998/3/31      0.0      2.5    4           S      0.0   
769  9334041.0         1995/6/8      3.0      2.4    0         NaN      3.9   

             Diagnosis  KCT RVVT  LAC Symptoms  Thrombosis  
0            MCTD, AMI  NaN  NaN    -      AMI           1  
1                  SLE    -    -    -      NaN           0  
2                  PSS  NaN  NaN  NaN      NaN           0  
3                  NaN  NaN  NaN    -      NaN           0  
4    SLE, SjS, vertigo  NaN  NaN  NaN      NaN           0  
..                 ...  ...  ...  ...      ...         ...  
765               MCTD  NaN  NaN  NaN      NaN           0  
766                PMR  NaN  NaN  NaN      NaN           0  
767                NaN  NaN  NaN  NaN      NaN           0  
768           SLE susp  NaN  NaN  NaN      NaN           0  
769                NaN  NaN  NaN  NaN      NaN           0  

[770 rows x 13 columns]>
# save to local
join_B.to_csv('../data/tsumotob_clean.csv', index=False)

3.2.3 Cleaning and data format

3.2.3.1 Parsing Diagnosis column

The diagnosis column contains one or more diagnosis. This session is to parse the column into several key words for further analysis. For those thrombosis positive patient, first, using special characters (including space) parses the text, and then calculating the top 10 frequency words. The similar procedure is repeated for those thrombosis negative patient.

# Thrombosis positive words
positive_words = [re.split(r'[^a-zA-Z0-9/s]+', text) for text, label in
                  zip(slab_df_raw['Diagnosis'], slab_df_raw['Thrombosis']) if label >= 1 and type(text) != float]

positive_dict = collections.Counter()
for phrases in positive_words:
    positive_dict += collections.Counter(phrases)

# clean up keywords
if "" in positive_dict:
    positive_dict.pop("")

positive_top10 = sorted(positive_dict.items(),
                        key=lambda x: x[1], reverse=True)[:10]
positive_top10
[('SLE', 44),
 ('APS', 17),
 ('SjS', 11),
 ('RA', 7),
 ('infarction', 6),
 ('MCTD', 5),
 ('brain', 5),
 ('CNS', 3),
 ('AMI', 2),
 ('abortion', 2)]
# Thrombosis negative words
negative_words = [re.split(r'[^a-zA-Z0-9/s]+', text) for text, label in
                  zip(slab_df_raw['Diagnosis'], slab_df_raw['Thrombosis']) if label == 0 and type(text) != float]

negative_dict = collections.Counter()
for phrases in negative_words:
    negative_dict += collections.Counter(phrases)

# clean up keywords
if "" in negative_dict:
    negative_dict.pop("")

negative_top10 = sorted(negative_dict.items(),
                        key=lambda x: x[1], reverse=True)[:10]
negative_top10
[('SLE', 142),
 ('SjS', 105),
 ('RA', 52),
 ('susp', 42),
 ('MCTD', 23),
 ('PSS', 21),
 ('APS', 16),
 ('PM', 16),
 ('IP', 13),
 ('Behcet', 11)]
3.2.3.2 Data format

Here, I change the examination date to datetime format.

join_B['Examination Date'] = pd.to_datetime(join_B['Examination Date'])
/var/folders/qn/dr1_1v717pjdgfssc9z4gdyw0000gp/T/ipykernel_96215/3742017373.py:1: SettingWithCopyWarning:


A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy

3.3 TSUMOTO_C

3.3.1 load data

# read data
clab_df_raw = pd.read_csv("../data/TSUMOTO_C.csv",
                          encoding='ISO-8859-1', on_bad_lines='skip')
# examine the data
clab_df_raw.head(5)
clab_df_raw.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 57510 entries, 0 to 57509
Data columns (total 44 columns):
 #   Column     Non-Null Count  Dtype  
---  ------     --------------  -----  
 0   ID         57510 non-null  int64  
 1   Date       57510 non-null  int64  
 2   GOT        46169 non-null  object 
 3   GPT        46154 non-null  object 
 4   LDH        46308 non-null  object 
 5   ALP        45521 non-null  object 
 6   TP         45636 non-null  object 
 7   ALB        45321 non-null  object 
 8   UA         45308 non-null  object 
 9   UN         46123 non-null  float64
 10  CRE        46219 non-null  object 
 11  T-BIL      38777 non-null  object 
 12  T-CHO      42420 non-null  object 
 13  TG         29258 non-null  object 
 14  CPK        20564 non-null  object 
 15  GLU        8432 non-null   object 
 16  WBC        49605 non-null  object 
 17  RBC        49605 non-null  object 
 18  HGB        49606 non-null  object 
 19  HCT        49606 non-null  object 
 20  PLT        47905 non-null  object 
 21  PT         4003 non-null   float64
 22  APTT       892 non-null    float64
 23  FG         3383 non-null   object 
 24  PIC        799 non-null    object 
 25  TAT        1363 non-null   object 
 26  TAT2       992 non-null    object 
 27  U-PRO      39108 non-null  object 
 28  IGG        12329 non-null  object 
 29  IGA        12326 non-null  object 
 30  IGM        12321 non-null  object 
 31  CRP        47101 non-null  object 
 32  RA         8560 non-null   object 
 33  RF         12567 non-null  object 
 34  C3         26863 non-null  object 
 35  C4         26864 non-null  object 
 36  RNP        742 non-null    object 
 37  SM         655 non-null    object 
 38  SC170      138 non-null    object 
 39  SSA        575 non-null    object 
 40  SSB        544 non-null    object 
 41  CENTROMEA  67 non-null     object 
 42  DNA        432 non-null    float64
 43  DNA-II     0 non-null      float64
dtypes: float64(5), int64(2), object(37)
memory usage: 19.3+ MB

3.3.2 Missing values and duplicates

The dataset contains missing values for some of the columns. All data have ID feature. Thus, it should be ok for joining analysis. Other column missing data shows below. The missing value will have different strategy to handle in EDA session.

# deduplicate
clab_df_raw.drop_duplicates()
ID Date GOT GPT LDH ALP TP ALB UA UN ... C3 C4 RNP SM SC170 SSA SSB CENTROMEA DNA DNA-II
0 2110 860419 24 12 152 63 7.5 4.5 3.4 16.0 ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
1 2110 860430 25 12 162 76 7.9 4.6 4.7 18.0 ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
2 2110 860502 22 8 144 68 7 4.2 5 18.0 ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
3 2110 860506 NaN NaN NaN NaN NaN NaN NaN NaN ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
4 2110 860507 NaN NaN NaN NaN NaN NaN NaN NaN ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
57505 5834596 980730 34 61 199 213.0 7.9 3.5 4.9 10.9 ... NaN NaN negative negative NaN 64 negative NaN NaN NaN
57506 5834596 980820 34 36 177 278.0 9 4.1 4.1 17.9 ... 142 18 NaN NaN NaN NaN NaN NaN NaN NaN
57507 5834596 981015 27 20 153 181.0 8.4 3.8 4.6 14.7 ... 133 14 NaN NaN NaN NaN NaN NaN NaN NaN
57508 5834596 981217 26 19 143 171.0 8.5 3.9 4.2 13.0 ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
57509 5845877 980828 32 21 300 305.0 7.9 3.5 4.2 17.5 ... 116 19 NaN NaN NaN NaN NaN NaN NaN NaN

57510 rows × 44 columns

clab_df_raw.isna().sum()
ID               0
Date             0
GOT          11341
GPT          11356
LDH          11202
ALP          11989
TP           11874
ALB          12189
UA           12202
UN           11387
CRE          11291
T-BIL        18733
T-CHO        15090
TG           28252
CPK          36946
GLU          49078
WBC           7905
RBC           7905
HGB           7904
HCT           7904
PLT           9605
PT           53507
APTT         56618
FG           54127
PIC          56711
TAT          56147
TAT2         56518
U-PRO        18402
IGG          45181
IGA          45184
IGM          45189
CRP          10409
RA           48950
RF           44943
C3           30647
C4           30646
RNP          56768
SM           56855
SC170        57372
SSA          56935
SSB          56966
CENTROMEA    57443
DNA          57078
DNA-II       57510
dtype: int64

3.3.3 Data format

Here, I change the date to datetime format.

clab_df_raw['Date'] = pd.to_datetime(clab_df_raw['Date'])
# measurement columns to numeric: for those don't have numeric entires, it will fill with NaN
clab_df_raw['GOT'] = pd.to_numeric(clab_df_raw['GOT'], errors='coerce')
clab_df_raw['GPT'] = pd.to_numeric(clab_df_raw['GPT'], errors='coerce')
clab_df_raw['LDH'] = pd.to_numeric(clab_df_raw['LDH'], errors='coerce')
clab_df_raw['ALP'] = pd.to_numeric(clab_df_raw['ALP'], errors='coerce')
clab_df_raw['TP'] = pd.to_numeric(clab_df_raw['TP'], errors='coerce')
clab_df_raw['ALB'] = pd.to_numeric(clab_df_raw['ALB'], errors='coerce')
clab_df_raw['UA'] = pd.to_numeric(clab_df_raw['UA'], errors='coerce')
clab_df_raw['UN'] = pd.to_numeric(clab_df_raw['UN'], errors='coerce')
clab_df_raw['CRE'] = pd.to_numeric(clab_df_raw['CRE'], errors='coerce')
clab_df_raw['T-BIL'] = pd.to_numeric(clab_df_raw['T-BIL'], errors='coerce')
clab_df_raw['T-CHO'] = pd.to_numeric(clab_df_raw['T-CHO'], errors='coerce')
clab_df_raw['TG'] = pd.to_numeric(clab_df_raw['TG'], errors='coerce')
clab_df_raw['CPK'] = pd.to_numeric(clab_df_raw['CPK'], errors='coerce')
clab_df_raw['GLU'] = pd.to_numeric(clab_df_raw['GLU'], errors='coerce')
clab_df_raw['WBC'] = pd.to_numeric(clab_df_raw['WBC'], errors='coerce')
clab_df_raw['RBC'] = pd.to_numeric(clab_df_raw['RBC'], errors='coerce')
clab_df_raw['HGB'] = pd.to_numeric(clab_df_raw['HGB'], errors='coerce')
clab_df_raw['HCT'] = pd.to_numeric(clab_df_raw['HCT'], errors='coerce')
clab_df_raw['PLT'] = pd.to_numeric(clab_df_raw['PLT'], errors='coerce')
clab_df_raw['PT'] = pd.to_numeric(clab_df_raw['PT'], errors='coerce')
clab_df_raw['APTT'] = pd.to_numeric(clab_df_raw['APTT'], errors='coerce')
clab_df_raw['FG'] = pd.to_numeric(clab_df_raw['FG'], errors='coerce')
clab_df_raw['PIC'] = pd.to_numeric(clab_df_raw['PIC'], errors='coerce')
clab_df_raw['TAT'] = pd.to_numeric(clab_df_raw['TAT'], errors='coerce')
clab_df_raw['U-PRO'] = pd.to_numeric(clab_df_raw['U-PRO'], errors='coerce')
clab_df_raw['IGG'] = pd.to_numeric(clab_df_raw['IGG'], errors='coerce')
clab_df_raw['IGA'] = pd.to_numeric(clab_df_raw['IGA'], errors='coerce')
clab_df_raw['IGM'] = pd.to_numeric(clab_df_raw['IGM'], errors='coerce')
clab_df_raw['RF'] = pd.to_numeric(clab_df_raw['RF'], errors='coerce')
clab_df_raw['C3'] = pd.to_numeric(clab_df_raw['C3'], errors='coerce')
clab_df_raw['C4'] = pd.to_numeric(clab_df_raw['C4'], errors='coerce')
clab_df_raw['DNA'] = pd.to_numeric(clab_df_raw['DNA'], errors='coerce')
clab_df_raw['DNA-II'] = pd.to_numeric(clab_df_raw['DNA-II'], errors='coerce')

3.3.4 New columns

For lab test, there are criteria indicating whether a indicator is out of normal range. In this session, new variable introduced based on those criteria. Note: those with gender difference will be dealt with later.

clab_df = clab_df_raw.copy()
clab_df = clab_df.assign(is_GOT_normal=np.where(clab_df['GOT'] >= 60, 'no', np.where(clab_df['GOT'] < 60, 'yes', 'NaN')),  # is_GOT_normal N<60
               is_GPT_normal=np.where(clab_df['GPT'] >= 60, 'no', np.where(
                   clab_df['GPT'] < 60, 'yes', 'NaN')),  # is_GPT_normal N<60
               is_LDH_normal=np.where(clab_df['LDH'] >= 500, 'no', np.where(
                   clab_df['LDH'] < 500, 'yes', 'NaN')),  # is_LDH_normal N<500
               is_ALP_normal=np.where(clab_df['ALP'] >= 300, 'no', np.where(
                   clab_df['ALP'] < 300, 'yes', 'NaN')),  # is_ALP_normal N<300
               is_TP_normal=np.where(((clab_df['TP'] >= 8.5) | (clab_df['TP'] <= 6)), 'no', np.where(
                   ((clab_df['TP'] < 8.5) & (clab_df['TP'] > 6)), 'yes', 'NaN')),  # is_TP_normal 6<N<8.5
               is_ALB_normal=np.where(((clab_df['ALB'] >= 5.5) | (clab_df['ALB'] <= 3.5)), 'no', np.where(
                   ((clab_df['ALB'] < 5.5) & (clab_df['ALB'] > 3.5)), 'yes', 'NaN')),  # is_ALB_normal 3.5<N<5.5
               is_UN_normal=np.where(clab_df['UN'] <= 30, 'no', np.where(
                   clab_df['UN'] > 30, 'yes', 'NaN')),  # is_UN_normal N>30
               is_CRE_normal=np.where(clab_df['CRE'] <= 1.5, 'no', np.where(
                   clab_df['CRE'] > 1.5, 'yes', 'NaN')),  # is_CRE_normal N>1.5
               is_TBIL_normal=np.where(clab_df['T-BIL'] >= 2, 'no', np.where(
                   clab_df['T-BIL'] < 2, 'yes', 'NaN')),  # is_T-BIL_normal N<2
               is_TCHO_normal=np.where(clab_df['T-CHO'] >= 250, 'no', np.where(
                   clab_df['T-CHO'] < 250, 'yes', 'NaN')),  # is_T-CHO_normal N<250
               is_TG_normal=np.where(clab_df['TG'] >= 200, 'no', np.where(
                   clab_df['TG'] < 200, 'yes', 'NaN')),  # is_TG_normal N<200
               is_CPK_normal=np.where(clab_df['CPK'] >= 250, 'no', np.where(
                   clab_df['CPK'] < 250, 'yes', 'NaN')),  # is_CPK_normal N<250
               is_GLU_normal=np.where(clab_df['GLU'] >= 180, 'no', np.where(
                   clab_df['GLU'] < 180, 'yes', 'NaN')),  # is_GLU_normal N<180
               is_WBC_normal=np.where(((clab_df['WBC'] >= 9000) | (clab_df['WBC'] <= 3500)), 'no', np.where(
                   ((clab_df['WBC'] < 9000) & (clab_df['WBC'] > 3500)), 'yes', 'NaN')),  # is_WBC_normal 3500<N<9000
               is_RBC_normal=np.where(((clab_df['RBC'] >= 600) | (clab_df['RBC'] <= 350)), 'no', np.where(
                   ((clab_df['RBC'] < 600) & (clab_df['RBC'] > 350)), 'yes', 'NaN')),  # is_RBC_normal 350<N<600
               is_HGB_normal=np.where(((clab_df['HGB'] >= 17) | (clab_df['HGB'] <= 10)), 'no', np.where(
                   ((clab_df['HGB'] < 17) & (clab_df['HGB'] > 10)), 'yes', 'NaN')),  # is_HGB_normal 10<N<17
               is_HCT_normal=np.where(((clab_df['HCT'] >= 52) | (clab_df['HCT'] <= 29)), 'no', np.where(
                   ((clab_df['HCT'] < 52) & (clab_df['HCT'] > 29)), 'yes', 'NaN')),  # is_HGB_normal 29<N<52
               is_PLT_normal=np.where(((clab_df['PLT'] >= 400) | (clab_df['PLT'] <= 100)), 'no', np.where(
                   ((clab_df['PLT'] < 400) & (clab_df['PLT'] > 100)), 'yes', 'NaN')),  # is_PLT_normal 100<N<400
               is_PT_normal=np.where(clab_df['PT'] >= 14, 'no', np.where(
                   clab_df['PT'] < 14, 'yes', 'NaN')),  # is_PT_normal N<14
               is_APTT_normal=np.where(clab_df['APTT'] >= 45, 'no', np.where(
                   clab_df['APTT'] < 45, 'yes', 'NaN')),  # is_APTT_normal N<45
               is_FG_normal=np.where(((clab_df['FG'] >= 450) | (clab_df['FG'] <= 150)), 'no', np.where(
                   ((clab_df['FG'] < 450) & (clab_df['FG'] > 150)), 'yes', 'NaN')),  # is_FG_normal 150<N<450
               is_PIC_normal=np.where(clab_df['PIC'] >= 0.8, 'no', np.where(
                   clab_df['PIC'] < 0.8, 'yes', 'NaN')),  # is_PIC_normal N<0.8
               is_TAT_normal=np.where(clab_df['TAT'] >= 3, 'no', np.where(
                   clab_df['TAT'] < 3, 'yes', 'NaN')),  # is_TAT_normal N<3
               is_UPRO_normal=np.where(((clab_df['U-PRO'] >= 30) | (clab_df['U-PRO'] <= 0)), 'no', np.where(
                   ((clab_df['U-PRO'] < 30) & (clab_df['U-PRO'] > 0)), 'yes', 'NaN')),  # is_U-PRO_normal 0<N<30
               is_IGG_normal=np.where(((clab_df['IGG'] >= 2000) | (clab_df['IGG'] <= 900)), 'no', np.where(
                   ((clab_df['IGG'] < 2000) & (clab_df['IGG'] > 900)), 'yes', 'NaN')),  # is_IGG_normal 900<N<2000
               is_IGA_normal=np.where(((clab_df['IGA'] >= 500) | (clab_df['IGA'] <= 80)), 'no', np.where(
                   ((clab_df['IGA'] < 500) & (clab_df['IGA'] > 80)), 'yes', 'NaN')),  # is_IGA_normal 80<N<500
               is_IGM_normal=np.where(((clab_df['IGM'] >= 400) | (clab_df['IGM'] <= 40)), 'no', np.where(
                   ((clab_df['IGM'] < 400) & (clab_df['IGM'] > 40)), 'yes', 'NaN')),  # is_IGM_normal 40<N<400
               is_RF_normal=np.where(clab_df['RF'] >= 20, 'no', np.where(
                   clab_df['RF'] < 20., 'yes', 'NaN')),  # is_RF_normal N<20
               is_C3_normal=np.where(clab_df['C3'] <= 35, 'no', np.where(
                   clab_df['C3'] > 35, 'yes', 'NaN')),  # is_C3_normal N>35
               is_C4_normal=np.where(clab_df['C4'] <= 10, 'no', np.where(
                   clab_df['C4'] > 10, 'yes', 'NaN')),  # is_C4_normal N>10
               is_DNA_normal=np.where(clab_df['DNA'] >= 8, 'no', np.where(
                   clab_df['DNA'] < 8., 'yes', 'NaN')),  # is_DNA_normal N<8
               is_DNAII_normal=np.where(clab_df['DNA-II'] >= 8, 'no', np.where(
                   clab_df['DNA-II'] < 8, 'yes', 'NaN')),  # is_DNAII_normal N<8

               )

3.4 Join data

For analysis using patient information, three dataset are joined to see if there are any patterns.

# cleaned a data
patients_df = pd.read_csv("../data/tsumotoa_clean.csv")
patients_df['b-day'] = pd.to_datetime(patients_df['b-day'])
patients_df['d-day'] = pd.to_datetime(patients_df['d-day'])
patients_df['f-day'] = pd.to_datetime(patients_df['f-day'])
# cleaned a data
slab_df = pd.read_csv("../data/tsumotob_clean.csv")
slab_df['Examination Date'] = pd.to_datetime(slab_df['Examination Date'])
# left join
join_df = clab_df.merge(patients_df, on='ID', how='left')

# outer join on b and c data
join_df = join_df.merge(slab_df, on='ID', how='outer')
# date formatting
join_df['b-day'] = pd.to_datetime(join_df['b-day'])
join_df['d-day'] = pd.to_datetime(join_df['d-day'])
join_df['f-day'] = pd.to_datetime(join_df['f-day'])
join_df['Examination Date'] = pd.to_datetime(join_df['Examination Date'])
join_df['Examination Date'] = pd.to_datetime(join_df['Examination Date'])

3.4.1 Missing values

There are missing values in the data. some of there are caused by merging dataset. For example, there are patient only did special lab test but not regular, or the other way around. For now, we are not removing any of them. We will deal with missing values in EDA session with various strategies according to the features.

join_df.isna().sum()
ID                0
Date            353
GOT           11732
GPT           11713
LDH           11594
              ...  
KCT           49373
RVVT          49373
LAC           46353
Symptoms      52459
Thrombosis    37583
Length: 98, dtype: int64
# Examination Date for special examination need to larger than birthday
# if not, delete those entries

join_df_2 = join_df[(join_df['Examination Date'] > join_df['b-day'])]

3.4.2 New columns

Age of special exam is the age the patient did the special examination.

join_df_2['age_of_sexam'] = join_df_2.apply(
    lambda row: relativedelta.relativedelta(row['Examination Date'], row['b-day']).years, axis=1)
# sanity check
join_df_2.head(5)
/var/folders/qn/dr1_1v717pjdgfssc9z4gdyw0000gp/T/ipykernel_96215/907839400.py:1: SettingWithCopyWarning:


A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
ID Date GOT GPT LDH ALP TP ALB UA UN ... ANA ANA Pattern aCL IgA Diagnosis KCT RVVT LAC Symptoms Thrombosis age_of_sexam
1421 133382.0 1970-01-01 00:00:00.000920910 17.0 16.0 256.0 148.0 6.7 3.6 5.8 10.0 ... 1024 S 0.0 SLE - - - NaN 0.0 62
1422 133382.0 1970-01-01 00:00:00.000920924 NaN NaN NaN NaN NaN NaN NaN NaN ... 1024 S 0.0 SLE - - - NaN 0.0 62
1423 133382.0 1970-01-01 00:00:00.000921013 16.0 14.0 237.0 148.0 6.7 3.6 5.4 12.0 ... 1024 S 0.0 SLE - - - NaN 0.0 62
1424 133382.0 1970-01-01 00:00:00.000921019 17.0 14.0 253.0 148.0 6.7 3.6 6.1 13.0 ... 1024 S 0.0 SLE - - - NaN 0.0 62
1425 133382.0 1970-01-01 00:00:00.000921020 NaN NaN NaN NaN NaN NaN NaN NaN ... 1024 S 0.0 SLE - - - NaN 0.0 62

5 rows × 99 columns

Is UA normal depends on the gender, N > 8.0 (Male) N > 6.5 (Female).

join_df = join_df.assign(is_UA_normal=np.where(((join_df['UA'] <= 8) & (join_df['SEX'] == 'M')), 'no', np.where(((join_df['UA'] < 8) & (join_df['SEX'] == 'M')), 'yes', np.where(((join_df['UA'] <= 6.5) & (join_df['SEX'] == 'F')), 'no',np.where(((join_df['UA'] > 6.5) & (join_df['SEX'] == 'F')), 'yes', 'NaN'))))) # N > 8.0 (Male) N > 6.5 (Female)

4 Exploratory analysis

4.1 Individual datasets

4.1.1 Summary statistics of patient information

# Defining colors for the pie chart
colors = ['pink', 'steelblue']

# Define the ratio of gap of each fragment in a tuple
explode = (0.05, 0.05)

# plot pie chart as grouped by sex
fig = plt.figure(1)
piechart = patients_df['SEX'].value_counts().plot(kind='pie',
                                                  autopct='%1.0f%%',
                                                  colors=colors,
                                                  explode=explode,
                                                  labels=["Female", "Male"],
                                                  fontsize=12)
plt.legend(bbox_to_anchor=(-0.05, -0.1), loc='lower left', fontsize=12)
plt.suptitle('Patient gender distribution', fontsize=16)
fig.show();

Figure 1: Patient gender distribution

Female is the dominated gender of the patients (Figure 1).

fig = plt.figure(2)
plt.scatter(patients_df['d-day'], patients_df['f-day'])
plt.ylim([pd.Timestamp('1970-01-01'), pd.Timestamp('2000-12-31')])
plt.title("description date VS first date")
plt.xlabel("description date")
plt.ylabel("first date")
fig.show();
# first date always smaller than description date
# can't use one date to replace another one

Figure 2: Description date VS First date

First date always smaller than description date, so that we can’t use one date to replace another one (Figure 2). There are some years have less or no data (i.e. 1995).

4.1.2 Special lab information

Thrombosis level vs examination date (Figure 3). We can see the large portion of case happened after 1992. 1998 is the year have the greatest number of cases.

ax = slab_df['Examination Date'].hist(by=slab_df['Thrombosis'],bins=10)

Figure 3: Examination Date VS Thrombosis

Plot keywords and frequency (Figure 4).

# join keywords
key_set = {k for k, _ in positive_top10} | {k for k, _ in negative_top10}
# order by Thrombosis positive decreasing
key_ls = sorted(key_set, key=lambda x: (positive_dict[x], negative_dict[x]))

# plot keywords RELATIVE frequency
nb_positive, nb_negative = sum(
    slab_df['Thrombosis'] > 0), sum(slab_df['Thrombosis'] == 0)

hN = plt.barh(key_ls, [negative_dict[k] /
              nb_negative for k in key_ls], label='negative', color='g')
hS = plt.barh(key_ls, [-positive_dict[k] /
              nb_positive for k in key_ls], label='positive')


plt.xlim([-1, 1])
xt = plt.xticks()
n = xt[0]
s = ['%.1f' % abs(i) for i in n]
plt.xticks(n, s)

plt.legend(loc='best')
plt.axvline(0.0)
fig.show();

Figure 4: Thrombosis diagnosis keywords

Correlation among three index in special lab results (Figure 5).

#
# # transfer by log(x+1)
# correlation could be misleading as too much 0 values
# red are Thrombosis and blue are not Thrombosis
temp = slab_df[['aCL IgG', 'aCL IgM', 'aCL IgA']].copy()
temp = temp.apply(lambda x: np.log10(x+1))

d_colors = {0: "blue", 1: "green", 2: "m", 3: "k"}
colors = [d_colors[x] for x in slab_df['Thrombosis']]
axl = pd.plotting.scatter_matrix(temp, color=colors)

Figure 5: ‘aCL IgG’, ‘aCL IgM’, ‘aCL IgA’ Correlation

Parallel coordinates for Thrombosis levels.
NOTE: The document indicate there are three levels of the Thrombosis, however, the data have 4 levels: 0, 1, 2, 3. Considering it make sense that 3 is another more severe level of Thrombosis, in this EDA, 3 is treated as more severe level of Thrombosis.

KCT_present = slab_df['KCT'].apply(lambda x: type(x)) != float
RVVT_present = slab_df['RVVT'].apply(lambda x: type(x)) != float
LAC_present = slab_df['LAC'].apply(lambda x: type(x)) != float

print("There are %s records with all three indexes present." %
      sum(KCT_present & RVVT_present & LAC_present))


temp_df = slab_df[['Thrombosis', 'KCT', 'RVVT', 'LAC']
                  ][KCT_present & RVVT_present & LAC_present].copy()
fig = plt.figure(6)

pd.plotting.parallel_coordinates(
    temp_df, 'Thrombosis', sort_labels=True, colormap='viridis')

fig.show();
There are 145 records with all three indexes present.

Figure 6: ‘KCT’, ‘RVVT’, ‘LAC’ Correlation

The overlap of parallel coordinates plot (Figure 6) indicates that three indicator combination is not a good indicator of the level of Thrombosis.

4.1.3 General lab information

fig = plt.figure(7,figsize=(12, 10))
corr = clab_df.corr(method='spearman')
sns.heatmap(corr, annot=True, annot_kws={"size": 28 / np.sqrt(len(corr))})

fig.show();

Figure 7: Heatmap of general lab index

There are some correlation in the data (Figure 7). For example, APTT and PT has high negative correlation. FG and APTT also has correlation of 0.53. If we want to do further analysis on the numerical value of those measurements, these correlations need to be considered.

4.2 Join dataset

## plot age and Thrombosis level
temp = join_df_2.pivot(columns='Thrombosis', values='age_of_sexam')
ax = temp.plot.hist(bins=20)

Figure 8: Thrombosis VS Age

Different level of Thrombosis have various age distribution (Figure 8). Regardless of number of samples, severe thrombosis happens high in 30’s, while level 2 thrombosis happens during younger age.

temp_df = join_df[['Thrombosis', 'is_GOT_normal', 'is_GPT_normal',
       'is_LDH_normal', 'is_ALP_normal', 'is_TP_normal', 'is_ALB_normal',
       'is_UN_normal', 'is_CRE_normal', 'is_TBIL_normal', 'is_TCHO_normal',
       'is_TG_normal', 'is_CPK_normal', 'is_GLU_normal', 'is_WBC_normal',
       'is_RBC_normal', 'is_HGB_normal', 'is_HCT_normal', 'is_PLT_normal',
       'is_PT_normal', 'is_APTT_normal', 'is_FG_normal', 'is_PIC_normal',
       'is_TAT_normal', 'is_UPRO_normal', 'is_IGG_normal', 'is_IGA_normal',
       'is_IGM_normal', 'is_RF_normal', 'is_C3_normal', 'is_C4_normal',
       'is_DNA_normal', 'is_DNAII_normal','is_UA_normal']
                  ]
temp_df = temp_df.dropna() ## don't need those without these information
temp_df['Thrombosis'] = temp_df['Thrombosis'].astype(float).astype(int)

temp_df.reset_index(drop=True)

temp_df1= temp_df[temp_df['Thrombosis'].isin([1,2,3])]
temp_df1.reset_index(drop=True)

temp_df0= temp_df[temp_df['Thrombosis']==0]
temp_df0.reset_index(drop=True)

temp_df2= temp_df[temp_df['Thrombosis']==1]
temp_df2.reset_index(drop=True)

temp_df3= temp_df[temp_df['Thrombosis']==2]
temp_df3.reset_index(drop=True)

temp_df4= temp_df[temp_df['Thrombosis']==3]
temp_df4.reset_index(drop=True)


fig = plt.figure(9)

pd.plotting.parallel_coordinates(
    temp_df, 'Thrombosis', sort_labels=True, colormap='viridis')
plt.xticks(rotation=45)



fig = plt.figure(10)

pd.plotting.parallel_coordinates(
    temp_df2, 'Thrombosis', sort_labels=True, colormap='viridis')
plt.xticks(rotation=45)



fig = plt.figure(11)

pd.plotting.parallel_coordinates(
    temp_df3, 'Thrombosis', sort_labels=True, colormap='viridis')
plt.xticks(rotation=45)



fig = plt.figure(12)

pd.plotting.parallel_coordinates(
    temp_df4, 'Thrombosis', sort_labels=True, colormap='viridis')
plt.xticks(rotation=45)

fig = plt.figure(13)

pd.plotting.parallel_coordinates(
    temp_df0, 'Thrombosis', sort_labels=True, colormap='viridis')
plt.xticks(rotation=45)
(array([ 0,  1,  2,  3,  4,  5,  6,  7,  8,  9, 10, 11, 12, 13, 14, 15, 16,
        17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32]),
 [Text(0, 0, 'is_GOT_normal'),
  Text(1, 0, 'is_GPT_normal'),
  Text(2, 0, 'is_LDH_normal'),
  Text(3, 0, 'is_ALP_normal'),
  Text(4, 0, 'is_TP_normal'),
  Text(5, 0, 'is_ALB_normal'),
  Text(6, 0, 'is_UN_normal'),
  Text(7, 0, 'is_CRE_normal'),
  Text(8, 0, 'is_TBIL_normal'),
  Text(9, 0, 'is_TCHO_normal'),
  Text(10, 0, 'is_TG_normal'),
  Text(11, 0, 'is_CPK_normal'),
  Text(12, 0, 'is_GLU_normal'),
  Text(13, 0, 'is_WBC_normal'),
  Text(14, 0, 'is_RBC_normal'),
  Text(15, 0, 'is_HGB_normal'),
  Text(16, 0, 'is_HCT_normal'),
  Text(17, 0, 'is_PLT_normal'),
  Text(18, 0, 'is_PT_normal'),
  Text(19, 0, 'is_APTT_normal'),
  Text(20, 0, 'is_FG_normal'),
  Text(21, 0, 'is_PIC_normal'),
  Text(22, 0, 'is_TAT_normal'),
  Text(23, 0, 'is_UPRO_normal'),
  Text(24, 0, 'is_IGG_normal'),
  Text(25, 0, 'is_IGA_normal'),
  Text(26, 0, 'is_IGM_normal'),
  Text(27, 0, 'is_RF_normal'),
  Text(28, 0, 'is_C3_normal'),
  Text(29, 0, 'is_C4_normal'),
  Text(30, 0, 'is_DNA_normal'),
  Text(31, 0, 'is_DNAII_normal'),
  Text(32, 0, 'is_UA_normal')])

Parallel coordinates plot of general lab index

5 Final plots

5.1 Plot-01: Thrombosis VS Age

/Users/xiaojingni/.matplotlib
plt.style.use("anly503")
## plot age and Thrombosis level
join_df_2['Thrombosis'] = join_df_2['Thrombosis'].astype(float).astype(int)
temp = join_df_2.pivot(columns='Thrombosis', values='age_of_sexam')
ax = temp.plot.hist(bins=20,edgecolor = "white")
ax.figure.set_figwidth(10)
ax.figure.set_figheight(6)
plt.title("Thrombosis VS Age")
plt.xlabel("Age")
plt.legend(title="Thrombosis level")
text =plt.text(s='Note: 0 is no thrombosis, 1 is positive thrombosis,\n 2 is severe thrombosis, 3 is more severe thrombosis.',y=-350,x=-13,fontsize=14)
plt.savefig("../plot/plot-01.png", bbox_inches = 'tight');

Thrombosis VS Age

5.2 Plot-02: Thrombosis diagnosis keywords

from matplotlib.ticker import FuncFormatter

fig = plt.figure(11)
fig.figure.set_figwidth(13)
fig.figure.set_figheight(7)

hN = plt.barh(key_ls, [negative_dict[k] /
              nb_negative for k in key_ls], label='negative', color='g')
hS = plt.barh(key_ls, [-positive_dict[k] /
              nb_positive for k in key_ls], label='positive')

plt.xlim([-1, 1])
xt = plt.xticks()
n = xt[0]
s = ['{:,.0%}'.format(abs(i)) for i in n]
yhn = [negative_dict[k]/nb_negative for k in key_ls]
yhs = [-positive_dict[k] /nb_positive for k in key_ls]
neglabel1 = ['{:,.0%}'.format(abs(i)) for i in yhn]
neglabel = [i if i!="0%" else "" for i in neglabel1]
# neglabel = list(map(lambda x: x.replace('\b0%', ''), neglabel1))
poslabel1 = ['{:,.0%}'.format(abs(i)) for i in yhs]
poslabel = [i if i!="0%" else "" for i in poslabel1]

plt.xticks(n, s)

plt.bar_label(hN, labels=neglabel, fontsize=18, padding=8)
plt.bar_label(hS, labels=poslabel,fontsize=18,padding=8)

plt.legend(loc='best')
plt.axvline(0.0)
plt.title("Thrombosis diagnosis keywords",pad=10, fontsize = 28)
plt.xlabel("Percentage",fontsize = 24)
plt.ylabel("Diagnosis",fontsize = 24)
plt.legend(title="Thrombosis diagnosis",fontsize = 24)
plt.savefig("../plot/plot-02.png");

Figure 9: Thrombosis diagnosis keywords

5.3 Plot-03: Correlation plot

fig = plt.figure(12,figsize=(14, 12))
temp = clab_df.loc[:, clab_df.columns != 'ID']
corr = temp.corr(method='spearman')
sns.heatmap(corr, annot=True, annot_kws={"size": 28 / np.sqrt(len(corr))})
plt.title("Special lab index correlation plot",pad=10, fontsize = 28 )
plt.xlabel("Index",fontsize = 24)
plt.ylabel("Index",fontsize = 24)
plt.savefig("../plot/plot-03.png");

Figure 10: Special lab index correlation plot

5.4 Plot-04: Parallel coordinates plot

from matplotlib.patches import Rectangle


temp_df['Thrombosis'] = temp_df['Thrombosis'].astype(float).astype(int)
temp_df.reset_index(drop=True)

temp_df1= temp_df[temp_df['Thrombosis'].isin([2,3])]
temp_df1.reset_index(drop=True)

plt.figure(13,figsize=(18, 10))

pd.plotting.parallel_coordinates(
    temp_df1.sort_values(by='Thrombosis'), 'Thrombosis', sort_labels=True,color=( '#FF6B6B', '#4ECDC4'))
plt.xticks(rotation=45)
plt.title("General lab examination VS Severe thrombosis", pad=20, fontsize = 36)
plt.xlabel("Examinatoion index",fontsize = 32)
plt.ylabel("Results",fontsize = 32)
plt.legend(title="Thrombosis level",bbox_to_anchor=(1, 0.8))
text = plt.text(s='Note: 2 is severe thrombosis, 3 is more severe thrombosis.',y=-0.8,x=-1,fontsize=24)

plt.gca().add_patch(Rectangle((4,0.9),3,0.2,
                    edgecolor='black',
                    facecolor='none',
                    lw=4))
plt.gca().add_patch(Rectangle((13,0.9),4,0.2,
                    edgecolor='black',
                    facecolor='none',
                    lw=4))

plt.savefig("../plot/plot-04.png", bbox_inches = 'tight');

Figure 11: Parallel coordinates plot of general lab index

6 Technical summary

This EDA has been implemented several techniques including histogram, correlation plot, and parallel coordinate plot. The histogram is used to obtain the idea of how data is distributed. For example, in Figure 3, the examination year is plotted against exmanimation count with various thrombosis levels, which demonstrate in what years, the different levels of thrombosis occured. This kind of plot can imply where the high frequency happened and need to be further examined. Correlation plot used Spearman ranked correlation (source) in Figure 7 and Figure 5. These plots demonstrate the correlation among variables. If there are some high correlationed features, these are the signs to further analyze high correlated feature or need to consider during modeling process. Finally, the parallel coordinate plots used in Figure 6 and Figure 11 are ways to show cluster information of categorical varibles. The color of the lines indicate different groups, while the patterns of the color-coded lines implying the clustering.
The datasets are required some manipulation and cleaning. In this EDA, the datasets have been deduped and checked missing values for exploration. However, for different tasks, missing values were treated differently. In the correlation calculation, missing values were removed, as most of the features in the correlation plot has a few missing values. For creating new variables to measure if a test is in the normal range, NA is introduced into the data when there was a bad entry. For example, if the measurement suposed to be a numeric but it is not, it will fill with NaN. Those NaN is not meaningless. It can indicate not having the test which implys that doctors may think the index is not relevant to thrombosis. The missing values also block me from exploring my inital question about the symptoms. Thus, I changed my analysis to focusing on diagnosis and thrombosis relationship.
The datasets and the documents also have some uncertainties. For example, the data document indicate there are three levels of the Thrombosis, however, the data have 4 levels: 0, 1, 2, 3. Considering it make sense that 3 is another more severe level of Thrombosis, in this EDA, 3 is treated as more severe level of Thrombosis.